AITopics | expression comprehension

Prior research in human-centric AI has primarily addressed single-modality tasks like pedestrian detection, action recognition, and pose estimation. However, the emergence of large multimodal models (LMMs) such as GPT-4V has redirected attention towards integrating language with visual content. Referring expression comprehension (REC) represents a prime example of this multimodal approach.

artificial intelligence, name change, proceedings, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.59)

Add feedback

Generalised Medical Phrase Grounding

Zhang, Wenjun, Chandra, Shekhar S., Nicolson, Aaron

arXiv.org Artificial IntelligenceDec-11-2025

Medical phrase grounding (MPG) maps textual descriptions of radiological findings to corresponding image regions. These grounded reports are easier to interpret, especially for non-experts. Existing MPG systems mostly follow the referring expression comprehension (REC) paradigm and return exactly one bounding box per phrase. Real reports often violate this assumption. They contain multi-region findings, non-diagnostic text, and non-groundable phrases, such as negations or descriptions of normal anatomy. Motivated by this, we reformulate the task as generalised medical phrase grounding (GMPG), where each sentence is mapped to zero, one, or multiple scored regions. To realise this formulation, we introduce the first GMPG model: MedGrounder. We adopted a two-stage training regime: pre-training on report sentence--anatomy box alignment datasets and fine-tuning on report sentence--human annotated box datasets. Experiments on PadChest-GR and MS-CXR show that MedGrounder achieves strong zero-shot transfer and outperforms REC-style and grounded report generation baselines on multi-region and non-groundable phrases, while using far fewer human box annotations. Finally, we show that MedGrounder can be composed with existing report generators to produce grounded reports without retraining the generator.

large language model, machine learning, medgrounder, (19 more...)

arXiv.org Artificial Intelligence

2512.01085

Country:

Oceania > Australia > Queensland > Brisbane (0.04)
North America > United States (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Nuclear Medicine (0.90)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Embodied Referring Expression Comprehension in Human-Robot Interaction

Islam, Md Mofijul, Gladstone, Alexi, Sarker, Sujan, Nanduru, Ganesh, Fahim, Md, Du, Keyan, Chadha, Aman, Iqbal, Tariq

arXiv.org Artificial IntelligenceDec-9-2025

As robots enter human workspaces, there is a crucial need for them to comprehend embodied human instructions, enabling intuitive and fluent human-robot interaction (HRI). However, accurate comprehension is challenging due to a lack of large-scale datasets that capture natural embodied interactions in diverse HRI settings. Existing datasets suffer from perspective bias, single-view collection, inadequate coverage of nonverbal gestures, and a predominant focus on indoor environments. To address these issues, we present the Refer360 dataset, a large-scale dataset of embodied verbal and nonverbal interactions collected across diverse viewpoints in both indoor and outdoor settings. Additionally, we introduce MuRes, a multimodal guided residual module designed to improve embodied referring expression comprehension. MuRes acts as an information bottleneck, extracting salient modality-specific signals and reinforcing them into pre-trained representations to form complementary features for downstream tasks. We conduct extensive experiments on four HRI datasets, including the Refer360 dataset, and demonstrate that current multimodal models fail to capture embodied interactions comprehensively; however, augmenting them with MuRes consistently improves performance. These findings establish Refer360 as a valuable benchmark and exhibit the potential of guided residual learning to advance embodied referring expression comprehension in robots operating within human environments.

artificial intelligence, machine learning, representation, (17 more...)

arXiv.org Artificial Intelligence

2512.06558

Country:

Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.05)
North America > United States > Virginia (0.04)
Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
(4 more...)

Genre: Research Report > New Finding (0.93)

Industry:

Materials (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Human Computer Interaction (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.62)

Add feedback

Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

Shao, Juexi, Li, Siyou, Gan, Yujian, Madge, Chris, Karan, Vanja, Poesio, Massimo

arXiv.org Artificial IntelligenceDec-3-2025

ABSTRACT Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics. Index T erms-- Visual Grounding, Referring Expression Comprehension, Generalized Referring Expression Comprehension, Coreference, Data Synthesis 1. INTRODUCTION Referring Expression Comprehension (REC) - the task of locating a target referred to by a natural language description.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2512.02791

Country: Europe > Austria > Vienna (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Zero-Shot Referring Expression Comprehension via Vison-Language True/False Verification

Liu, Jeffrey, Hu, Rongbin

arXiv.org Artificial IntelligenceNov-14-2025

Referring Expression Comprehension (REC) is usually addressed with task-trained grounding models. We show that a zero-shot workflow, without any REC-specific training, can achieve competitive or superior performance. Our approach reformulates REC as box-wise visual-language verification: given proposals from a COCO-clean generic detector (YOLO-World), a general-purpose VLM independently answers True/False queries for each region. This simple procedure reduces cross-box interference, supports abstention and multiple matches, and requires no fine-tuning. On RefCOCO, RefCOCO+, and RefCOCOg, our method not only surpasses a zero-shot GroundingDINO baseline but also exceeds reported results for GroundingDINO trained on REC and GroundingDINO+CRG. Controlled studies with identical proposals confirm that verification significantly outperforms selection-based prompting, and results hold with open VLMs. Overall, we show that workflow design, rather than task-specific pretraining, drives strong zero-shot REC performance.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2509.09958

Country:

North America > United States > Tennessee > Davidson County > Nashville (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

8644353f7d307baaf29bc1e56fe8e0ec-Paper-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsOct-9-2025, 00:24:26 GMT

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > China > Hubei Province > Wuhan (0.04)
Asia > China > Heilongjiang Province > Daqing (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Robots (0.93)
(2 more...)

Add feedback

GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions

Liu, Bing, Yv, Wenqiang, Yang, Xuzheng, Wang, Shichang, Liu, Junzhuo, Wang, Peng, Wang, Guoqing, Yang, Yang, Shen, Heng Tao

arXiv.org Artificial IntelligenceSep-26-2025

Abstract--AI-driven geometric problem solving is a complex vision-language task that requires accurate diagram interpretation, mathematical reasoning, and robust cross-modal grounding. A foundational yet underexplored capability for this task is the ability to identify and interpret geometric elements based on natural language queries. T o address this, we introduce the task of Referring Expression Comprehension (REC) for geometric problems, which evaluates whether models can localize points, shapes, and spatial relations in diagrams in response to textual prompts. We present GeoRef, a benchmark dataset constructed from existing geometric problem corpora, featuring diverse, high-quality annotations and queries. Due to the lack of annotated data for this task, we generate a large-scale synthetic training dataset using a structured geometric formal language, enabling broad coverage of geometric concepts and facilitating model adaptation. We explore two fine-tuning approaches: Supervised Fine-T uning (SFT) and Group Relative Policy Optimization (GRPO). Our results show that GRPO significantly outperforms SFT by better aligning model behavior with task-specific rewards. Furthermore, we propose a verify-and-regenerate mechanism that detects incorrect predictions and re-infers answers using contextual reasoning history, further boosting accuracy. Moreover, models trained on GeoRef demonstrate measurable improvements on downstream geometric reasoning tasks, highlighting the broader value of REC as a foundation for multimodal mathematical understanding. I for geometric problem solving presents a unique challenge at the intersection of vision and language, requiring not only logical reasoning but also precise diagram interpretation, spatial understanding, and cross-modal grounding.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.2105

Country: Asia > China (0.04)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination

Dai, Ming, Cheng, Wenxuan, Zhuang, Jiedong, Liu, Jiang-jiang, Zhao, Hongshen, Feng, Zhenhua, Yang, Wankou

arXiv.org Artificial IntelligenceSep-8-2025

Recent advances in visual grounding have largely shifted away from traditional proposal-based two-stage frameworks due to their inefficiency and high computational complexity, favoring end-to-end direct reference paradigms. However, these methods rely exclusively on the referred target for supervision, overlooking the potential benefits of prominent prospective targets. Moreover, existing approaches often fail to incorporate multi-granularity discrimination, which is crucial for robust object identification in complex scenarios. To address these limitations, we propose PropVG, an end-to-end proposal-based framework that, to the best of our knowledge, is the first to seamlessly integrate foreground object proposal generation with referential object comprehension without requiring additional detectors. Furthermore, we introduce a Contrastive-based Refer Scoring (CRS) module, which employs contrastive learning at both sentence and word levels to enhance the capability in understanding and distinguishing referred objects. Additionally, we design a Multi-granularity Target Discrimination (MTD) module that fuses object- and semantic-level information to improve the recognition of absent targets. Extensive experiments on gRefCOCO (GREC/GRES), Ref-ZOM, R-RefCOCO, and RefCOCO (REC/RES) benchmarks demonstrate the effectiveness of PropVG. The codes and models are available at https://github.com/Dmmm1997/PropVG.

large language model, machine learning, segmentation, (17 more...)

arXiv.org Artificial Intelligence

2509.04833

Country: Asia > China > Heilongjiang Province > Daqing (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

Collaborating Authors

expression comprehension

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

8644353f7d307baaf29bc1e56fe8e0ec-Paper-Datasets_and_Benchmarks.pdf

a376802c0811f1b9088828288eb0d3f0-Paper.pdf

A Large-Scale Human-Centric Benchmark for Referring Expression Comprehension in the LMM Era

Generalised Medical Phrase Grounding

Embodied Referring Expression Comprehension in Human-Robot Interaction

Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

Zero-Shot Referring Expression Comprehension via Vison-Language True/False Verification

8644353f7d307baaf29bc1e56fe8e0ec-Paper-Datasets_and_Benchmarks.pdf

GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions

PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination